In this project I plan to work with a dataset called “Students Mental Health Assessments.” This dataset contains a range of information about students, including their age, course of study, gender, stress levels, depression scores, anxiety scores, and more. It’s a comprehensive dataset that can provide insights into the mental health of students. This dataset is collected from Kaggle.Lateron to get better analysis did choose another dataset as well that information is later on in this report

##Reading Data(Importing Data)

# Assuming your file is named "students_mental_health_survey.csv"
file_path <- "students_mental_health_survey.csv"

# Read the CSV file without showing column types
Student_Data <- read_csv(file_path, show_col_types = FALSE)

# Print information about the data
print(Student_Data)
## # A tibble: 7,022 × 20
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    25 Others         Male    3.56            3                3             2
##  2    24 Engineering    Female  2.44            0                3             0
##  3    19 Business       Female  3.74            4                0             3
##  4    19 Computer Scie… Male    3.65            2                1             0
##  5    18 Business       Male    3.4             3                3             4
##  6    21 Medical        Female  3.35            2                4             3
##  7    18 Law            Male    3.65            2                2             5
##  8    21 Business       Female  3.4             0                3             3
##  9    24 Medical        Male    3.8             3                2             1
## 10    19 Engineering    Female  3.05            2                5             0
## # ℹ 7,012 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>
# Print the number of rows and columns
cat("Number of rows:", nrow(Student_Data), "\n")
## Number of rows: 7022
cat("Number of columns:", ncol(Student_Data), "\n")
## Number of columns: 20

Tidying the Data

Handle Missing Values

# Replace empty strings with NA
Student_Data[Student_Data == ""] <- NA

# Remove rows with missing values
Student_Data_without_missing_data <- na.omit(Student_Data)

# Check the dimensions of the cleaned data
dim(Student_Data_without_missing_data)
## [1] 6995   20

#Printing number of rows and columns after Dropping the missing Data

Student_Data_without_missing_data
## # A tibble: 6,995 × 20
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    25 Others         Male    3.56            3                3             2
##  2    24 Engineering    Female  2.44            0                3             0
##  3    19 Business       Female  3.74            4                0             3
##  4    18 Business       Male    3.4             3                3             4
##  5    21 Medical        Female  3.35            2                4             3
##  6    18 Law            Male    3.65            2                2             5
##  7    21 Business       Female  3.4             0                3             3
##  8    24 Medical        Male    3.8             3                2             1
##  9    22 Computer Scie… Male    3.19            1                1             3
## 10    27 Medical        Male    3.26            3                2             2
## # ℹ 6,985 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>
cat("Number of rows:", nrow(Student_Data_without_missing_data ), "\n")
## Number of rows: 6995
cat("Number of columns:", ncol(Student_Data_without_missing_data ), "\n")
## Number of columns: 20

Exploring factors influencing stress levels.

I want to know what are the factors that are effecting the stress levels, so drawn plots stress level against remaining (19)features. For easy visualization I choose scatter plots for the continuous variables and box plot for categorical variables

# Scatter Plot of Stress Level and CGPA
scatter_plot_cgpa <- ggplot(Student_Data_without_missing_data, aes(x = CGPA, y = Stress_Level,colour=Course)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and CGPA", x = "CGPA", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
polar_scater_plot<-ggplot(Student_Data_without_missing_data, aes(x = CGPA, y = Stress_Level, color = CGPA)) +
  geom_point() +
  labs(title = "Polar Scatter Plot of Stress Level by CGPA", x = "CGPA", y = "Stress_Level") +
  theme(plot.title = element_text(hjust = 0.5)) +  # Center the title
  coord_polar() +
  scale_color_viridis_c()
# Display plot
print(scatter_plot_cgpa)

print(polar_scater_plot)

The plots depicts the relationship between stress level and CGPA, with the CGPA ranging from 2.5 to 4.0 and stress levels on a discrete scale. The plot shows that individuals with similar CGPAs can have varying stress levels, indicating no clear trend between CGPA and stress. Data points are densely populated between CGPAs of 3.0 to 3.5. The absence of a visible pattern suggests that other factors may influence the relationship between stress and academic performance.

#Scatter Plot of Stress Level and Age
scatter_plot_age <- ggplot(Student_Data_without_missing_data, aes(x = Age, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and Age", x = "Age", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(scatter_plot_age)

The scatter plot shows the relationship between stress level and age. Stress level is plotted on the vertical axis, ranging from 0 to 5, and age is on the horizontal axis, ranging from about 20 to 35 years. The plot presents a uniform distribution of stress levels across different ages with no apparent pattern, indicating that within this dataset, age does not seem to significantly affect reported stress levels. Each horizontal line of dots represents all the reported stress levels for a particular age, and every age has a full range of stress levels.

#Scatter Plot of Stress Level and Anxiety Score
scatter_plot_Anxiety_Score <- ggplot(Student_Data_without_missing_data, aes(x = Anxiety_Score, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and Anxiety Score", x = "Anxiety_Score", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5))  # Center the title

# Display plot
print(scatter_plot_Anxiety_Score)

The scatter plot illustrates the relationship between stress level (vertical axis) and anxiety score (horizontal axis), both seemingly rated on a scale from 0 to 5. Similar to the previous plots, there’s a uniform distribution with each level of anxiety having a full range of stress levels associated with it. This uniformity indicates no clear correlation between the two variables in the data presented. Each point likely represents an individual’s reported stress level at a given anxiety score, and the plot suggests that an individual’s stress level can vary widely regardless of their anxiety score.

#Scatter Plot of Stress Level and Depression Score
scatter_plot_Depression_Score <- ggplot(Student_Data_without_missing_data, aes(x = Depression_Score, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and Depression Score", x = "Depression_Score", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

# Display plot
print(scatter_plot_Depression_Score)

The scatter plot shows the relationship between stress level and depression score, with both variables appearing to be rated on a scale from 0 to 5. The plot displays a uniform distribution of stress levels across the range of depression scores, indicating no clear pattern or correlation within this dataset. Each depression score has associated with it a full span of stress levels, suggesting that stress levels are independent of depression scores in this sample. Each point likely represents an individual’s stress level for their respective depression score.

# Boxplot of Stress Level by Gender
boxplot_gender <- ggplot(Student_Data_without_missing_data, aes(x = Gender, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Gender", x = "Gender", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

# Display plot
print(boxplot_gender)

The boxplot compares stress levels between two gender groups: female and male. The stress level, likely on a scale from 0 to 5, is represented on the vertical axis. Both boxplots have a median stress level around 2, indicated by the line within the box. The range of stress levels (interquartile range) for both genders is similar, spanning from about 2 to 4, with the ‘whiskers’ suggesting that there are individuals who report stress levels slightly outside this range. There are no visible outliers, and the distribution seems quite symmetrical for both genders, suggesting a similar variation in stress levels among females and males within this dataset.

# Boxplot of Stress Level by Course
boxplot_course <- ggplot(Student_Data_without_missing_data, aes(x = Course, y = Stress_Level, fill = Course)) +
  geom_boxplot() +
  labs(title = "Plot of Stress Level by Course", x = "Course", y = "Stress_Level") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

# Display the plot
print(boxplot_course)

boxplot_course+coord_polar()

The plot displays stress levels by academic course, with medians around 2 for all fields. Variability in stress is similar for Business, Computer Science, Law, Engineering and Others. The Medical field has a marginally higher median stress level, but overall, the stress level distribution is consistent across courses without any outliers.

#Boxplot of Stress level by Sleep Quality
boxplot_sleep_quality <- ggplot(Student_Data_without_missing_data, aes(x = Sleep_Quality, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Sleep Quality", x = "Sleep_Quality", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_sleep_quality)

The boxplot illustrates stress levels categorized by sleep quality, which is divided into ‘Average’, ‘Good’, and ‘Poor’. The stress level is consistent across all sleep quality categories with a median around 2. The interquartile ranges are also similar, indicating that the spread of stress levels is relatively uniform regardless of sleep quality. There are no outliers or extreme values, suggesting that within this dataset, perceived sleep quality does not show a distinct impact on stress levels.

#Diet_Quality
boxplot_Diet_Quality <- ggplot(Student_Data_without_missing_data, aes(x = Diet_Quality, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Diet Quality", x = "Diet_Quality", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Diet_Quality)

The boxplot shows stress levels in relation to diet quality, categorized as ‘Average’, ‘Good’, and ‘Poor’. Stress levels are on a scale on the vertical axis, and the median stress level is consistent across all diet categories, situated around the 3 mark. The interquartile ranges, which represent the middle 50% of responses, are similar for all three diet qualities, showing no significant variation in stress levels with diet quality. No outliers are present, indicating a uniform distribution of stress levels across the dietary spectrum in the dataset.

#Boxplot of Stress Level by Social_Support
boxplot_Social_Support <- ggplot(Student_Data_without_missing_data, aes(x = Social_Support, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Social Support", x = "Social_Support", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Social_Support)

The boxplot displays stress levels in relation to social support, categorized as ‘High’, ‘Low’, and ‘Moderate’. The median stress level appears consistent across all categories, near the value of 2. The interquartile ranges are similar, indicating a comparable spread of stress levels regardless of social support level. There are no outliers or extreme values shown, which suggests that within this dataset, the perceived level of social support does not have a distinct impact on the reported stress levels.

#Boxplot of Stress Level by Relationship_Status
boxplot_Relationship_Status <- ggplot(Student_Data_without_missing_data, aes(x = Relationship_Status, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Relationship Status", x = "Relationship_Status", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title


# Display plot
print(boxplot_Relationship_Status)

The boxplot compares stress levels by relationship status: ‘In a Relationship’, ‘Married’, and ‘Single’. Stress levels are indicated on the vertical axis and all three categories show a median stress level close to 2. The interquartile ranges, representing the middle 50% of data, are similar across the categories, suggesting a comparable spread of stress levels regardless of relationship status. The ‘whiskers’ of the plot indicate the range of the data within 1.5 times the interquartile range, and they appear to be similar for all groups. There are no outliers indicated, which suggests that there is no significant variation in stress levels that can be attributed to relationship status within this dataset.

#Boxplot of Stress Level by Physical_Activity
boxplot_Physical_Activity <- ggplot(Student_Data_without_missing_data, aes(x = Physical_Activity, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Physical Activity", x = "Physical_Activity", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Physical_Activity)

The boxplot depicts stress levels based on physical activity, categorized as ‘High’, ‘Low’, and ‘Moderate’. The median stress level for each category is approximately at the 2 mark on the vertical axis. The spread of stress levels, as indicated by the interquartile ranges, is consistent across all levels of physical activity. There are no outliers shown, suggesting a uniform distribution of stress levels within each physical activity group in this dataset.

#Boxplot of Stress Level by Substance_Use
boxplot_Substance_Use <- ggplot(Student_Data_without_missing_data, aes(x = Substance_Use, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Substance Use", x = "Substance_Use", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Substance_Use)

The boxplot illustrates stress levels in relation to substance use, categorized as ‘Frequently’, ‘Never’, and ‘Occasionally’. The median stress level for ‘Never’ and ‘Occasionally’ is about the same, while ‘Frequently’ is slightly higher. The range of stress levels, indicated by the boxes, is broader for those who ‘Never’ use substances, with ‘Frequently’ showing the narrowest interquartile range. Notably, there are outliers in the ‘Frequently’ category, indicating individuals with extremely low stress levels, unlike the other two categories. This suggests some variations in stress levels associated with substance use frequency.

#Boxplot of Stress Level by Counseling_Service_Use
boxplot_Counseling_Service_Use= ggplot(Student_Data_without_missing_data, aes(x = Counseling_Service_Use, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Counseling Service Use", x = "Counseling_Service_Use", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Counseling_Service_Use)

The boxplot depicts stress levels in relation to the use of counseling services, categorized as ‘Frequently’, ‘Never’, and ‘Occasionally’. The median stress level is around 2 for those who ‘Never’ or ‘Occasionally’ use counseling services, and slightly higher for those who ‘Frequently’ use them. The interquartile ranges are comparable across all categories, suggesting a similar spread of stress levels irrespective of counseling service usage. The ‘whiskers’ extend to show the full range of reported stress levels, with no significant outliers present, which indicates a relatively consistent distribution of stress levels across the frequency of counseling service use within this dataset.

#Boxplot of Stress Level by Family_History
boxplot_Family_History <- ggplot(Student_Data_without_missing_data, aes(x = Family_History, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Family History", x = "Family_History", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Family_History)

The boxplot shows stress levels categorized by family history of mental health issues, with ‘No’ on the left and ‘Yes’ on the right. The median stress level for those with no family history is around 2.5, and for those with a family history, it is slightly higher, near 3. The interquartile range is slightly wider for those with a family history, indicating a greater spread in stress levels. Both categories have a similar overall range of stress levels, as indicated by the ‘whiskers’. There are no significant outliers in either category, suggesting that while there may be a slight difference in median stress levels, the overall distribution is fairly consistent regardless of family history.

#Boxplot of Stress Level by Chronic_Illness
boxplot_Chronic_Illness <- ggplot(Student_Data_without_missing_data, aes(x = Chronic_Illness, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Chronic Illness", x = "Chronic_Illness", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(boxplot_Chronic_Illness)

The boxplot compares stress levels based on the presence of a chronic illness, with categories ‘No’ and ‘Yes’. Those without a chronic illness have a median stress level slightly above 2, while those with a chronic illness have a median stress level near 3. The interquartile ranges, indicating the middle 50% of data, are similar for both groups, suggesting a comparable variability in stress levels. The ‘whiskers’ show the full range of stress levels within each category, with no significant outliers. This indicates that while the median stress level is slightly higher for those with a chronic illness, the overall distribution of stress levels is relatively consistent between the two groups.

#Scatter Plot of Stress Level and Financial_Stress
scatter_Financial_Stress <- ggplot(Student_Data_without_missing_data, aes(x = Financial_Stress, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and Financial Stress", x = "Financial_Stress", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(scatter_Financial_Stress)

The scatter plot illustrates the relationship between stress level and financial stress, with financial stress on the horizontal axis and stress level on the vertical axis, both seemingly rated on a scale from 0 to 5. The plot shows that for each level of financial stress, there is a full range of reported stress levels, suggesting that within this dataset, there is no clear trend indicating that financial stress directly correlates with overall stress level. Individuals report a wide variety of stress levels at each point of financial stress, indicating variability in the impact of financial stress on overall stress levels.

#Boxplot of Stress Level by Extracurricular_Involvement
boxplot_Extracurricular_Involvement <- ggplot(Student_Data_without_missing_data, aes(x = Extracurricular_Involvement, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Extracurricular Involvement", x = "Extracurricular_Involvement", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plots
print(boxplot_Extracurricular_Involvement)

The boxplot compares stress levels based on the extent of extracurricular involvement, with categories ‘High’, ‘Low’, and ‘Moderate’. Two categories have similar median stress levels, around the 2 mark,while High category have median around 3. The ‘whiskers’ of the plot extend to the full range of reported stress levels within each category, and no outliers are present. This indicates that within this dataset, the level of extracurricular involvement does show a distinct impact on reported stress levels.

#Scatter Plot of Stress level and Semester_Credit_Load
scatter_Semester_Credit_Load <- ggplot(Student_Data_without_missing_data, aes(x = Semester_Credit_Load, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress level and Semester Credit Load", x = "Semester_Credit_Load", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plot
print(scatter_Semester_Credit_Load)

The scatter plot displays the relationship between stress level and semester credit load. Stress level is on the vertical axis, and semester credit load is on the horizontal axis, ranging from approximately 15 credits. Each dot represents an individual’s reported stress level at a particular credit load.The plot shows a uniform distribution across different credit loads, with each level of credit load having a full range of stress levels from 0 to 5. This indicates no clear correlation between the number of credits a student is taking and their reported stress level. Students with low, moderate, and high credit loads report a similar range of stress levels, suggesting that factors other than credit load alone may influence stress levels.

#Boxplot of Stress Level by Extracurricular_Involvement
boxplot_Residence_Type <- ggplot(Student_Data_without_missing_data, aes(x = Residence_Type, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Residence Type", x = "Residence_Type", y = "Stress_Level")+
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
# Display plots
print(boxplot_Residence_Type)

The boxplot illustrates stress levels among individuals based on their type of residence: ‘Off-Campus’, ‘On-Campus’, and ‘With Family’. The median stress level for off-campus and on-campus residence types is similar, hovering with family marks 3. The ‘Off-Campus’ and ‘On-Campus’ categories have comparable interquartile ranges. The whiskers, which represent the range of the data, extend similarly across all types, and no significant outliers are present. This indicates that residence type do have a major impact on stress levels within this dataset.

From the results I observed that these are the factors namely:Course,Substance Use,Counseling service used,Family history,Chronic illness,Extracurricular Involvement,Residence type effect the stress levels in student (This is based on the dataset I am using )

##Data Transformation

Using filter() to filter the data so that I will have the data that has only effects the stress levels .

filtered_data <- Student_Data_without_missing_data %>%
  filter(
    Substance_Use=="Frequently" |
    Counseling_Service_Use=="Frequently" |
    Family_History=="Yes" |
    Chronic_Illness=="Yes" |
    Extracurricular_Involvement=="High" |
    Residence_Type=="With Family"
  )
filtered_data 
## # A tibble: 4,498 × 20
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    19 Business       Female  3.74            4                0             3
##  2    18 Business       Male    3.4             3                3             4
##  3    21 Business       Female  3.4             0                3             3
##  4    24 Medical        Male    3.8             3                2             1
##  5    22 Computer Scie… Male    3.19            1                1             3
##  6    27 Medical        Male    3.26            3                2             2
##  7    25 Law            Male    3.61            3                1             5
##  8    18 Medical        Female  3.85            4                1             3
##  9    19 Medical        Male    3.26            5                1             1
## 10    22 Computer Scie… Male    3.46            3                1             0
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>

I wanted to have the data that only have the specific features but when I used “&” it resulted as 0 rows. so went ahead and used “|” operator so that i will have the data that cummulatively will have the features that I desire to examine. filtered_data is the new data that has filtered data

#Arrange the data based on CGPA in descending order

arranged_data <- arrange(filtered_data, desc(CGPA))
arranged_data
## # A tibble: 4,498 × 20
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    19 Law            Male       4            5                0             1
##  2    24 Business       Female     4            2                1             0
##  3    28 Law            Female     4            0                1             4
##  4    25 Computer Scie… Female     4            0                3             3
##  5    25 Law            Male       4            1                4             0
##  6    27 Engineering    Female     4            4                4             2
##  7    21 Computer Scie… Male       4            1                1             2
##  8    20 Law            Male       4            2                2             1
##  9    18 Engineering    Male       4            3                0             3
## 10    20 Law            Male       4            4                2             3
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>

arranged_data is the data that have the CGPA arranged in Descending Order.But it is changing the cgpa values as from the filtered_data, I see that the CGPA varies from 4.00 - 2.60 in filtered_data but in the arranged_data the CGPA varies from 4.00-3.73.

arranged_data_continuous <- arrange(filtered_data,Course)
arranged_data_continuous
## # A tibble: 4,498 × 20
##      Age Course   Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>    <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    19 Business Female  3.74            4                0             3
##  2    18 Business Male    3.4             3                3             4
##  3    21 Business Female  3.4             0                3             3
##  4    23 Business Female  3.7             0                1             2
##  5    24 Business Female  4               2                1             0
##  6    23 Business Male    3.75            3                2             2
##  7    18 Business Female  3.71            0                1             2
##  8    28 Business Female  3.12            4                1             2
##  9    24 Business Male    3.64            2                3             1
## 10    29 Business Female  3.12            1                5             2
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>

Even I checked arrange() with continuous variable “Course” even here it is giving the courses only Business and Computer Science. I am not sure what is wrong for now.I tried the way Dr. Swan still unable to get the correct format.

#Arranging columns with select() select() pulls out only the named columns. Some datasets come with a huge number of variables, and it can be helpful to only focus on those of immediate interest:

select(filtered_data, "Course",
  "Stress_Level",
  "Substance_Use",
  "Counseling_Service_Use",
  "Family_History",
  "Chronic_Illness",
  "Extracurricular_Involvement",
  "Residence_Type"
)
## # A tibble: 4,498 × 8
##    Course       Stress_Level Substance_Use Counseling_Service_Use Family_History
##    <chr>               <dbl> <chr>         <chr>                  <chr>         
##  1 Business                4 Never         Occasionally           No            
##  2 Business                3 Never         Never                  No            
##  3 Business                0 Never         Never                  Yes           
##  4 Medical                 3 Frequently    Never                  Yes           
##  5 Computer Sc…            1 Never         Occasionally           No            
##  6 Medical                 3 Never         Occasionally           No            
##  7 Law                     3 Never         Never                  Yes           
##  8 Medical                 4 Never         Never                  Yes           
##  9 Medical                 5 Never         Never                  Yes           
## 10 Computer Sc…            3 Never         Frequently             No            
## # ℹ 4,488 more rows
## # ℹ 3 more variables: Chronic_Illness <chr>, Extracurricular_Involvement <chr>,
## #   Residence_Type <chr>

Renaming the variables using rename()

rename(filtered_data, Course = Course, Stress_Points = Stress_Level,Credit_Load=Semester_Credit_Load,Counseling_Service =Counseling_Service_Use)
## # A tibble: 4,498 × 20
##      Age Course        Gender  CGPA Stress_Points Depression_Score Anxiety_Score
##    <dbl> <chr>         <chr>  <dbl>         <dbl>            <dbl>         <dbl>
##  1    19 Business      Female  3.74             4                0             3
##  2    18 Business      Male    3.4              3                3             4
##  3    21 Business      Female  3.4              0                3             3
##  4    24 Medical       Male    3.8              3                2             1
##  5    22 Computer Sci… Male    3.19             1                1             3
##  6    27 Medical       Male    3.26             3                2             2
##  7    25 Law           Male    3.61             3                1             5
##  8    18 Medical       Female  3.85             4                1             3
##  9    19 Medical       Male    3.26             5                1             1
## 10    22 Computer Sci… Male    3.46             3                1             0
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Credit_Load <dbl>, Residence_Type <chr>

#Adding new variables with mutate()

mutated_data <- filtered_data %>% mutate(Total_Score = Depression_Score + Anxiety_Score)
mutated_data
## # A tibble: 4,498 × 21
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    19 Business       Female  3.74            4                0             3
##  2    18 Business       Male    3.4             3                3             4
##  3    21 Business       Female  3.4             0                3             3
##  4    24 Medical        Male    3.8             3                2             1
##  5    22 Computer Scie… Male    3.19            1                1             3
##  6    27 Medical        Male    3.26            3                2             2
##  7    25 Law            Male    3.61            3                1             5
##  8    18 Medical        Female  3.85            4                1             3
##  9    19 Medical        Male    3.26            5                1             1
## 10    22 Computer Scie… Male    3.46            3                1             0
## # ℹ 4,488 more rows
## # ℹ 14 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>, Total_Score <dbl>

As after mutating there are 2 rows combined to one so I an going to drop those columns:

# Drop multiple columns
After_Deleting_Columns<- select(mutated_data, -Depression_Score,-Anxiety_Score)
After_Deleting_Columns
## # A tibble: 4,498 × 19
##      Age Course        Gender  CGPA Stress_Level Sleep_Quality Physical_Activity
##    <dbl> <chr>         <chr>  <dbl>        <dbl> <chr>         <chr>            
##  1    19 Business      Female  3.74            4 Good          Low              
##  2    18 Business      Male    3.4             3 Good          Low              
##  3    21 Business      Female  3.4             0 Average       Low              
##  4    24 Medical       Male    3.8             3 Poor          Low              
##  5    22 Computer Sci… Male    3.19            1 Average       Moderate         
##  6    27 Medical       Male    3.26            3 Average       Moderate         
##  7    25 Law           Male    3.61            3 Good          Low              
##  8    18 Medical       Female  3.85            4 Good          Low              
##  9    19 Medical       Male    3.26            5 Good          Low              
## 10    22 Computer Sci… Male    3.46            3 Good          Moderate         
## # ℹ 4,488 more rows
## # ℹ 12 more variables: Diet_Quality <chr>, Social_Support <chr>,
## #   Relationship_Status <chr>, Substance_Use <chr>,
## #   Counseling_Service_Use <chr>, Family_History <chr>, Chronic_Illness <chr>,
## #   Financial_Stress <dbl>, Extracurricular_Involvement <chr>,
## #   Semester_Credit_Load <dbl>, Residence_Type <chr>, Total_Score <dbl>

##After filtering and mutating visualizing the relation among stresslevel and other factors

# Explore factors influencing stress levels

# Scatter plot for numerical variables
facet_grid_plot <- ggplot(data = After_Deleting_Columns) +
  geom_point(mapping = aes(x = CGPA, y = Stress_Level, color = Course)) +
  facet_grid(rows = vars(Gender), cols = vars(Course), labeller = label_both)


facet_grid<-ggplot(data = Student_Data) +
  geom_point(mapping = aes(x = CGPA, y = Stress_Level, color = Gender)) +
  facet_grid(cols = vars(Gender))



# Boxplot for categorical variables
boxplot_gender <- ggplot(After_Deleting_Columns, aes(x = Gender, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Gender", x = "Gender", y = "Stress Level")

boxplot_course <- ggplot(After_Deleting_Columns, aes(x = Course, y = Stress_Level)) +
  geom_boxplot() +
  labs(title = "Boxplot of Stress Level by Course", x = "Course", y = "Stress Level") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Display plots
print(facet_grid_plot)

print(facet_grid)
## Warning: Removed 12 rows containing missing values (`geom_point()`).

print(boxplot_gender)

print(boxplot_course)

Still I find no difference behaviour of variables.

#Finding Average of CGPA

# Summarize the data to calculate average CGPA
avg_cgpa <- After_Deleting_Columns %>% summarize(Avg_CGPA = mean(CGPA, na.rm = TRUE))
avg_cgpa
## # A tibble: 1 × 1
##   Avg_CGPA
##      <dbl>
## 1     3.49

The average CGPA in the dataset is approximately 3.49.This suggests a moderate overall academic performance among the students. The CGPA is a key metric reflecting students’ academic success.

avg_stress_level <-After_Deleting_Columns%>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
avg_stress_level 
## # A tibble: 1 × 1
##   Avg_Stress
##        <dbl>
## 1       2.46

The average stress level in the dataset is around 2.46.This indicates a moderate level of stress among the students.Stress level is an essential metric reflecting the mental well-being of the students.

Group by Gender and calculate average CGPA

grouped_data_by_gender_cgpa <- mutated_data %>% group_by(Gender) %>% summarize(Avg_CGPA = mean(CGPA, na.rm = TRUE))
grouped_data_by_gender_cgpa
## # A tibble: 2 × 2
##   Gender Avg_CGPA
##   <chr>     <dbl>
## 1 Female     3.49
## 2 Male       3.48

The data suggests a marginal difference in CGPA between female and male students

grouped_data_by_gender<- mutated_data %>% group_by(Gender) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_gender
## # A tibble: 2 × 2
##   Gender Avg_Stress
##   <chr>       <dbl>
## 1 Female       2.44
## 2 Male         2.47

The data suggests a small difference in stress levels between female and male students.

grouped_data_by_Sleep_Quality <- mutated_data %>% group_by(Sleep_Quality) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Sleep_Quality
## # A tibble: 3 × 2
##   Sleep_Quality Avg_Stress
##   <chr>              <dbl>
## 1 Average             2.47
## 2 Good                2.45
## 3 Poor                2.44

Students with an “Average” sleep quality have an average stress level of approximately 2.47. Those with a “Good” sleep quality have a slightly lower average stress level of around 2.45. Students with a “Poor” sleep quality exhibit the lowest average stress level of approximately 2.44

grouped_data_by_Course <- mutated_data %>% group_by(Course) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Course
## # A tibble: 6 × 2
##   Course           Avg_Stress
##   <chr>                 <dbl>
## 1 Business               2.14
## 2 Computer Science       2.08
## 3 Engineering            2.15
## 4 Law                    2.19
## 5 Medical                3.23
## 6 Others                 2.08

From the table it is observed that average stress is more for the Medical students

grouped_data_by_Chronic_Illness <- mutated_data %>% group_by(Chronic_Illness) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Chronic_Illness
## # A tibble: 2 × 2
##   Chronic_Illness Avg_Stress
##   <chr>                <dbl>
## 1 No                    2.45
## 2 Yes                   2.56

The students with Chronic_Illness is 0.1 are having more stress level compared to the students that don’t have

grouped_data_by_Residence_Type <- mutated_data %>% group_by(Residence_Type) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Residence_Type
## # A tibble: 3 × 2
##   Residence_Type Avg_Stress
##   <chr>               <dbl>
## 1 Off-Campus           2.47
## 2 On-Campus            2.42
## 3 With Family          2.49

From plots when it is observed it should the Residence_Type of with family have more stress levels but from the average values it is clear that the off-campus and with family students are similar stress level

grouped_data_by_Extracurricular_Involvement <- mutated_data %>% group_by(Extracurricular_Involvement) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Extracurricular_Involvement
## # A tibble: 3 × 2
##   Extracurricular_Involvement Avg_Stress
##   <chr>                            <dbl>
## 1 High                              2.45
## 2 Low                               2.46
## 3 Moderate                          2.46

I am shocked by seeing the average stress values with respect to Extracurricular_Involvement as from the plots it displays that people with high Extracurricular_Involvement has more stress level . I guess as we are seeing the average of stress_level ,the results are differing.

grouped_data_by_Substance_Use <- mutated_data %>% group_by(Substance_Use) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Substance_Use
## # A tibble: 3 × 2
##   Substance_Use Avg_Stress
##   <chr>              <dbl>
## 1 Frequently          2.44
## 2 Never               2.46
## 3 Occasionally        2.46

Here there is slight that is by 0.01 is difference between than avg stress level among Substance use

grouped_data_by_Counseling_Service_Use <- mutated_data %>% group_by(Counseling_Service_Use) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Counseling_Service_Use
## # A tibble: 3 × 2
##   Counseling_Service_Use Avg_Stress
##   <chr>                       <dbl>
## 1 Frequently                   2.54
## 2 Never                        2.43
## 3 Occasionally                 2.46

Here the students that have frequently used counselling service have 0.1 more average stress level

grouped_data_by_Family_History <- mutated_data %>% group_by(Family_History) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Family_History
## # A tibble: 2 × 2
##   Family_History Avg_Stress
##   <chr>               <dbl>
## 1 No                   2.44
## 2 Yes                  2.48

The students with family history have 0.04 more average stress level

mutated_data
## # A tibble: 4,498 × 21
##      Age Course         Gender  CGPA Stress_Level Depression_Score Anxiety_Score
##    <dbl> <chr>          <chr>  <dbl>        <dbl>            <dbl>         <dbl>
##  1    19 Business       Female  3.74            4                0             3
##  2    18 Business       Male    3.4             3                3             4
##  3    21 Business       Female  3.4             0                3             3
##  4    24 Medical        Male    3.8             3                2             1
##  5    22 Computer Scie… Male    3.19            1                1             3
##  6    27 Medical        Male    3.26            3                2             2
##  7    25 Law            Male    3.61            3                1             5
##  8    18 Medical        Female  3.85            4                1             3
##  9    19 Medical        Male    3.26            5                1             1
## 10    22 Computer Scie… Male    3.46            3                1             0
## # ℹ 4,488 more rows
## # ℹ 14 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## #   Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## #   Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## #   Chronic_Illness <chr>, Financial_Stress <dbl>,
## #   Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## #   Residence_Type <chr>, Total_Score <dbl>

##Modeling the Data

# Create a scatter plot
scatter_plot <- ggplot(mutated_data, aes(x = CGPA, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and CGPA", x = "CGPA", y = "Stress_Level") +
  theme(plot.title = element_text(hjust = 0.5))

# Add the linear regression line
linear_model_plot <- scatter_plot +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Linear Model: Stress Level ~ CGPA")

# Display the plot
print(linear_model_plot)
## `geom_smooth()` using formula = 'y ~ x'

# Create a scatter plot
scatter_plot <- ggplot(mutated_data, aes(x = Total_Score, y = Stress_Level)) +
  geom_point() +
  labs(title = "Scatter Plot of Stress Level and Total_Score", x = "Total_Score", y = "Stress_Level") +
  theme(plot.title = element_text(hjust = 0.5))

# Add the linear regression line
linear_model_plot <- scatter_plot +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Linear Model: Stress Level ~ Total_Score")

# Display the plot
print(linear_model_plot)
## `geom_smooth()` using formula = 'y ~ x'

# Assuming linear_model is the linear model object
linear_model <- lm(Stress_Level ~ Total_Score, data = mutated_data)

# Print the summary of the linear model
summary(linear_model)
## 
## Call:
## lm(formula = Stress_Level ~ Total_Score, data = mutated_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.639 -1.439 -0.240  1.481  2.760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.63873    0.05414  48.741  < 2e-16 ***
## Total_Score -0.03987    0.01064  -3.747 0.000181 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.633 on 4496 degrees of freedom
## Multiple R-squared:  0.003113,   Adjusted R-squared:  0.002892 
## F-statistic: 14.04 on 1 and 4496 DF,  p-value: 0.0001811

From the above outputs I think lm model is not good for my dataset as there are no features that show linear relationship. As you can see there is sifnificantly low r squred value

##Modeling using Machine Learning First using Random Forest to know what are the features are important in my data

model_rf <- randomForest(Stress_Level ~ ., data = mutated_data)
importance(model_rf)
##                             IncNodePurity
## Age                              974.3391
## Course                          1332.3462
## Gender                           216.5634
## CGPA                            1441.9538
## Depression_Score                 526.2089
## Anxiety_Score                    535.2441
## Sleep_Quality                    346.2841
## Physical_Activity                355.1167
## Diet_Quality                     338.6833
## Social_Support                   355.4438
## Relationship_Status              337.2928
## Substance_Use                    285.6439
## Counseling_Service_Use           353.6239
## Family_History                   210.6777
## Chronic_Illness                  124.8738
## Financial_Stress                 669.6823
## Extracurricular_Involvement      371.4175
## Semester_Credit_Load            1006.6853
## Residence_Type                   361.0957
## Total_Score                      670.1209

The variable importance scores, measured by IncNodePurity, provide insights into the features contributing significantly to stress levels in the dataset. Notably, CGPA emerges as the most influential, followed by Total_Score and Semester_Credit_Load, suggesting a strong connection between academic performance and stress. Additional impactful factors include age and course type. Interestingly, extracurricular involvement and residence type exhibit moderate importance, emphasizing the multifaceted nature of stress determinants. These insights can guide further exploration and interventions to mitigate stress, particularly among students.

Trying to do boxplot the output of the model

# Assuming you stored your random forest model in a variable called 'rf_model'
# Extract variable importance
# Extract variable importance
#importance_values <- importance(model_rf, type = 1)[, "IncNodePurity"]

# Create a data frame
#var_importance <- data.frame(
  #Variable = names(importance_values),
  #Importance = importance_values
#)

# Create a bar plot

#ggplot(var_importance, aes(x = reorder(Variable, -Importance), y = Importance)) +
  #geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  #labs(title = "Variable Importance in Random Forest Model",
      # x = "Variable",
     #  y = "Importance") +
#  theme(axis.text)

I am getting an Error in importance(model_rf, type = 1)[, “IncNodePurity”] : subscript out of bounds. Tried multiple ways to resolve it as IncNodePurity is in 0 position but R doesn’t consider 0 position when I tried giving 1 it pops up error

##Modeling My data for prediction purposes

# Step 1: Data Splitting
set.seed(123)
splitIndex <- createDataPartition(mutated_data$Stress_Level, p = 0.7, list = FALSE)
train_data <- mutated_data[splitIndex, ]
test_data <- mutated_data[-splitIndex, ]

# Step 2: Data Preprocessing (apply as needed)

# Step 3: Model Selection (multinomial logistic regression)

multinom_model <- multinom(Stress_Level ~ ., data = train_data)
## # weights:  210 (170 variable)
## initial  value 5645.834088 
## iter  10 value 5510.678407
## iter  20 value 5435.817565
## iter  30 value 5407.313965
## iter  40 value 5385.346316
## iter  50 value 5380.602475
## iter  60 value 5378.515633
## iter  70 value 5378.060746
## iter  80 value 5377.841727
## iter  90 value 5377.752102
## iter 100 value 5377.728107
## final  value 5377.728107 
## stopped after 100 iterations
# Step 4: Model Training (already done within the multinom function)

# Step 5: Model Evaluation
multinom_predictions <- predict(multinom_model, newdata = test_data)
# Check and set levels for multinom_predictions
unique_levels_multinom <- unique(multinom_predictions)
unique_levels_test <- unique(test_data$Stress_Level)

# Identify mismatched levels
missing_levels <- setdiff(unique_levels_test, unique_levels_multinom)
extra_levels <- setdiff(unique_levels_multinom, unique_levels_test)

# Remove extra levels from multinom_predictions
multinom_predictions <- factor(multinom_predictions, levels = unique_levels_multinom)

# Add missing levels to multinom_predictions
if (length(missing_levels) > 0) {
  multinom_predictions <- factor(multinom_predictions, levels = c(unique_levels_multinom, missing_levels))
}

# Remove extra levels from test_data$Stress_Level
test_data$Stress_Level <- factor(test_data$Stress_Level, levels = unique_levels_test)

# Add missing levels to test_data$Stress_Level
if (length(extra_levels) > 0) {
  test_data$Stress_Level <- factor(test_data$Stress_Level, levels = c(unique_levels_test, extra_levels))
}

# Now run confusionMatrix
confusionMatrix(multinom_predictions, test_data$Stress_Level)
## Warning in confusionMatrix.default(multinom_predictions,
## test_data$Stress_Level): Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  3  1  2  4  5
##          0 45 33 31 44 21 20
##          3 39 65 49 41 43 39
##          1 52 57 63 56 29 20
##          2 62 54 56 47 20 37
##          4 13 42 17 25 56 51
##          5  5 25  8 23 31 28
## 
## Overall Statistics
##                                          
##                Accuracy : 0.2257         
##                  95% CI : (0.2036, 0.249)
##     No Information Rate : 0.2049         
##     P-Value [Acc > NIR] : 0.0328110      
##                                          
##                   Kappa : 0.0665         
##                                          
##  Mcnemar's Test P-Value : 0.0002106      
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 3 Class: 1 Class: 2 Class: 4 Class: 5
## Sensitivity           0.20833  0.23551  0.28125  0.19915  0.28000  0.14359
## Specificity           0.86826  0.80299  0.80944  0.79388  0.87097  0.92014
## Pos Pred Value        0.23196  0.23551  0.22744  0.17029  0.27451  0.23333
## Neg Pred Value        0.85169  0.80299  0.84953  0.82353  0.87402  0.86390
## Prevalence            0.16036  0.20490  0.16630  0.17520  0.14848  0.14477
## Detection Rate        0.03341  0.04826  0.04677  0.03489  0.04157  0.02079
## Detection Prevalence  0.14402  0.20490  0.20564  0.20490  0.15145  0.08909
## Balanced Accuracy     0.53830  0.51925  0.54534  0.49652  0.57548  0.53186

Visualizing the Confusion matrix

# Confusion Matrix
conf_matrix <- matrix(c(32, 37, 31, 33, 12, 17,
                        66, 80, 69, 78, 65, 57,
                        44, 64, 51, 56, 36, 26,
                        52, 45, 47, 30, 30, 42,
                        9, 20, 13, 15, 21, 24,
                        13, 30, 13, 24, 36, 29), nrow = 6, byrow = TRUE)

colnames(conf_matrix) <- c("0", "3", "1", "2", "4", "5")
rownames(conf_matrix) <- c("0", "3", "1", "2", "4", "5")

# Convert to data frame for ggplot
conf_df <- as.data.frame(as.table(conf_matrix))
conf_df$Var1 <- factor(conf_df$Var1, levels = c("0", "3", "1", "2", "4", "5"))
conf_df$Var2 <- factor(conf_df$Var2, levels = c("0", "3", "1", "2", "4", "5"))

# Plot Confusion Matrix
ggplot(conf_df, aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix", x = "Reference", y = "Prediction")

# Plot Overall Statistics
overall_stats <- data.frame(
  Metric = c("Accuracy", "Kappa", "Sensitivity", "Specificity"),
  Value = c(0.1804, 0.006, 0.14815, 0.88506)  # Replace with actual values
)

ggplot(overall_stats, aes(x = Metric, y = Value)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Overall Statistics", y = "Value")

The confusion matrix and statistics suggest that the model has low accuracy (0.1804). Sensitivity, specificity, and other metrics for each class are also provided. The Kappa value is close to zero, indicating poor agreement. McNemar’s Test has a very low p-value, suggesting a significant difference between predicted and actual values. Class-specific metrics show varied performance across different stress levels.

So i am still working on using other models so that there will be rise in accuracy. I am not sure but tried many ways but couldn’t improve it.In future I will try explore other datasets .

I tried using other dataset from the kaggle: #Student Stress Factors: A Comprehensive Analysis(Understanding the Underlying Causes and Their Impact on Today’s Students.) For this before exploring to my analysis I tried to answer the questions that are published by the auther of the dataset. I felt if i am able to answer those question then that would be helpful in analysing my analysis ##Reading Data(Importing Data)

# Assuming your file is named "students_mental_health_survey.csv"
file_path <- "StressLevelDataset.csv"

# Read the CSV file without showing column types
Student_Data_2 <- read_csv(file_path, show_col_types = FALSE)

# Print information about the data
print(Student_Data_2)
## # A tibble: 1,100 × 21
##    anxiety_level self_esteem mental_health_history depression headache
##            <dbl>       <dbl>                 <dbl>      <dbl>    <dbl>
##  1            14          20                     0         11        2
##  2            15           8                     1         15        5
##  3            12          18                     1         14        2
##  4            16          12                     1         15        4
##  5            16          28                     0          7        2
##  6            20          13                     1         21        3
##  7             4          26                     0          6        1
##  8            17           3                     1         22        4
##  9            13          22                     1         12        3
## 10             6           8                     0         27        4
## # ℹ 1,090 more rows
## # ℹ 16 more variables: blood_pressure <dbl>, sleep_quality <dbl>,
## #   breathing_problem <dbl>, noise_level <dbl>, living_conditions <dbl>,
## #   safety <dbl>, basic_needs <dbl>, academic_performance <dbl>,
## #   study_load <dbl>, teacher_student_relationship <dbl>,
## #   future_career_concerns <dbl>, social_support <dbl>, peer_pressure <dbl>,
## #   extracurricular_activities <dbl>, bullying <dbl>, stress_level <dbl>
# Print the number of rows and columns
cat("Number of rows:", nrow(Student_Data_2), "\n")
## Number of rows: 1100
cat("Number of columns:", ncol(Student_Data_2), "\n")
## Number of columns: 21

Overview of dataset:

Anxiety : range from 0 to 21, Measure : GAD-7 Self-esteem : range 0 to 30, Measure: Rosenberg Self Esteem Scale Mental Health History : 0 if no mental health history, 1 if mental health history Depression : range 0 to 27, Measure: Patient Health Questionnaire (PHQ-9) Other features mostly range from 0 to 5 considering 0,1 to be low, 2,3 to be mid, and 4,5 to be high. Blood pressure: range 1 to 3 (There was no direct explanation for this one, I assume 1 is low, 2 is normal, 3 is high) Stress level: range 0 to 2 (Again no explanation, I assume, 0 low level/absent, 1 medium level, 2 high level) *Generally lower score is better, but for others it is hard to decide. For example sleep quality, we can have an educated guess if we do some data analysis.

//Taken from the kaggle discussion about the dataset as I was short of time tried to learn from the discussion. (https://www.kaggle.com/datasets/rxnach/student-stress-factors-a-comprehensive-analysis)

This dataset contains around 20 features that create the most impact on the Stress of a Student. The features are selected scientifically considering 5 major factors, they are Psychological, Physiological, Social, Environmental, and Academic Factors. Some of them are: Psychological Factors => ‘anxiety_level’, ‘self_esteem’, ‘mental_health_history’, ‘depression’, Physiological Factors => ‘headache’, ‘blood_pressure’, ‘sleep_quality’, ‘breathing_problem Environmental Factors => ’noise_level’, ‘living_conditions’, ‘safety’, ‘basic_needs’, Academic Factors => ‘academic_performance’, ‘study_load’, ‘teacher_student_relationship’, ‘future_career_concerns’, Social Factor => ‘social_support’, ‘peer_pressure’, ‘extracurricular_activities’, ‘bullying’

Trying to answer the following questions:

##Descriptive Statistics:

  1. How many students are in the dataset?
number_of_students <- nrow(Student_Data_2)

# Print the result
print(number_of_students)
## [1] 1100

There are total 1100 students in this dataset.

  1. What is the average anxiety level of students in the dataset?
# Replace 'anxiety_level' with the actual column name for anxiety level
average_anxiety <- mean(Student_Data_2$anxiety_level, na.rm = TRUE)

# Print the result
print(average_anxiety)
## [1] 11.06364

Average Anxiety level is 11.06

  1. How many students have reported a history of mental health issues?
# Replace 'mental_health_history' with the actual column name for mental health history
students_with_mental_health_history <- sum(Student_Data_2$mental_health_history == 1, na.rm = TRUE)

# Print the result
print(students_with_mental_health_history)
## [1] 542

542 students have reported a history of mental health issues

##Psychological Factors:

  1. How many students have a self-esteem level below the average?
# Replace 'self_esteem' with the actual column name for self-esteem
self_esteem_average <- mean(Student_Data_2$self_esteem, na.rm = TRUE)

# Count the number of students with self-esteem below the average
students_below_average_self_esteem <- sum(Student_Data_2$self_esteem < self_esteem_average, na.rm = TRUE)

# Print the result
print(students_below_average_self_esteem)
## [1] 507

507 students have a self-esteem level below the average?

  1. What percentage of students have reported experiencing depression?
# Replace 'depression' with the actual column name for depression
students_with_depression <- sum(Student_Data_2$depression == 1, na.rm = TRUE)

# Total number of students
total_students <- nrow(Student_Data_2)

# Calculate the percentage
percentage_with_depression <- (students_with_depression / total_students) * 100

# Print the result
print(percentage_with_depression)
## [1] 3.272727

3.27 is the percentage of students have reported experiencing depression

##Physiological Factors:

  1. How many students experience headaches frequently?
students_with_frequent_headaches <- sum(Student_Data_2$headache == 5, na.rm = TRUE)

# Print the result
print(students_with_frequent_headaches)
## [1] 129

Here the frequency count is taken as 5 . There are 129 students experiencing headaches frequently.

  1. What is the average blood pressure reading among the students?
average_blood_pressure <- mean(Student_Data_2$blood_pressure, na.rm = TRUE)
# Print the result
print(average_blood_pressure)
## [1] 2.181818

Average blood pressure reading among the students is 2.18

  1. How many students rate their sleep quality as poor?
# Replace 1 with the value that indicates poor sleep quality in your dataset
students_with_poor_sleep_quality <- sum(Student_Data_2$sleep_quality == 1, na.rm = TRUE)

# Print the result
print(students_with_poor_sleep_quality)
## [1] 328

Here I consider 1 as the least sleep quality measure. Students rate their sleep quality as poor are 328

##Environmental Factors:

  1. How many students live in conditions with high noise levels?
students_with_high_noise_levels <- sum(Student_Data_2$noise_level == 5, na.rm = TRUE)

# Print the result
print(students_with_high_noise_levels)
## [1] 137

137 Students live in conditions with high noise level

  1. What percentage of students feel unsafe in their living conditions?
# Count the occurrences of students feeling unsafe
unsafe_students_count <- sum(Student_Data_2$safety == 5, na.rm = TRUE)

# Total number of students
total_students <- nrow(Student_Data_2)

# Calculate the percentage
percentage_feeling_unsafe <- (unsafe_students_count / total_students) * 100
print(percentage_feeling_unsafe)
## [1] 15.54545

15.54% of students feel unsafe in their living conditions

  1. How many students have reported not having their basic needs met?
unmet_basic_needs_count <- sum(Student_Data_2$basic_needs == 5, na.rm = TRUE)

# Print the result
print(unmet_basic_needs_count)
## [1] 189

189 students have reported not having their basic needs met

##Academic Factors:

  1. How many students rate their academic performance as below average?
below_average_performance_count <- sum(Student_Data_2$academic_performance < 2, na.rm = TRUE)

# Print the result
print(below_average_performance_count)
## [1] 213

As the academic performance range from 1-5 thought 2 can be considered as average. Students rate their academic performance as below average are 213

  1. What is the average study load reported by students?
average_study_load <- mean(Student_Data_2$study_load, na.rm = TRUE)

# Print the result
print(average_study_load)
## [1] 2.621818

Average study load reported by students is 2.62

  1. How many students have concerns about their future careers?
career_concerns_count <- sum(Student_Data_2$future_career_concerns == 3, na.rm = TRUE)

# Print the result
print(career_concerns_count)
## [1] 173

There are 173 students those concerns about their future career

##Social Factors:

  1. How many students feel they have strong social support?
strong_social_support_count <- sum(Student_Data_2$social_support >= 3, na.rm = TRUE)

# Print the result
print(strong_social_support_count)
## [1] 458

I considered 3 as the strong social support. So, there are 458 students feel they have strong social support

  1. What percentage of students have experienced bullying?
# Count the occurrences of students experiencing bullying
bullying_count <- sum(Student_Data_2$bullying == 5, na.rm = TRUE)

# Total number of students
total_students <- nrow(Student_Data_2)

# Calculate the percentage
percentage_experiencing_bullying <- (bullying_count / total_students) * 100

# Print the result
print(percentage_experiencing_bullying)
## [1] 15.72727

Assuming a value of 5 indicates experiencing bullying.15.72% of students have experienced bullying.

  1. How many students participate in extracurricular activities?
extracurricular_participation_count <- sum(Student_Data_2$extracurricular_activities == 1, na.rm = TRUE)

# Print the result
print(extracurricular_participation_count)
## [1] 184

Assuming a value of 1 indicates participation in extracurricular activities,184 students participate in extracurricular activities

##Comparative Analysis:

  1. Is there a correlation between anxiety level and academic performance?
# Calculate the correlation coefficient
correlation_coefficient <- cor(Student_Data_2$anxiety_level, Student_Data_2$academic_performance, use = "complete.obs")

# Print the result
print(correlation_coefficient)
## [1] -0.6496011

The correlation coefficient of -0.6496011 suggests a moderately strong negative correlation between anxiety levels and academic performance in the dataset.

  1. Do students with poor sleep quality also report higher levels of depression?
correlation_coefficient <- cor(Student_Data_2$sleep_quality, Student_Data_2$depression, use = "complete.obs")

# Print the result
print(correlation_coefficient)
## [1] -0.6931609
ggplot(Student_Data_2, aes(x = sleep_quality, y = depression)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Scatter Plot with Regression Line: Sleep Quality vs. Depression Levels", x = "Sleep Quality", y = "Depression Levels")
## `geom_smooth()` using formula = 'y ~ x'

The correlation coefficient of -0.69 indicates a moderately strong negative relationship between sleep quality and depression levels. As sleep quality decreases, depression levels tend to increase. This suggests a notable association between these variables in the dataset.

  1. Are students who experience bullying more likely to have a history of mental health issues?
contingency_table <- table(Student_Data_2$bullying, Student_Data_2$mental_health_history)

contingency_df <- as.data.frame.matrix(contingency_table)

# Rename the columns for better visualization
colnames(contingency_df) <- c("No Mental Health History", "History of Mental Health Issues")

# Create a stacked bar chart
ggplot(contingency_df, aes(x = rownames(contingency_df), y = `History of Mental Health Issues`, fill = `No Mental Health History`)) +
  geom_bar(stat = "identity") +
  labs(title = "Bullying vs. Mental Health History",
       x = "Bullying",
       y = "Count",
       fill = "Mental Health History") +
  theme_minimal()

Performing Chi-square Test

# Perform a chi-squared test for independence
chi_squared_test <- chisq.test(contingency_table)

# Print the result
print(chi_squared_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 532, df = 5, p-value < 2.2e-16

It appears that students who experience bullying are more likely to have a history of mental health issues based on the data it is according to the test bu from the graph I see no such thing.

##General Exploration:

  1. Which factor (Psychological, Physiological, Environmental, Academic, Social) has the highest number of students reporting negative experiences or conditions?
# Create a summary variable for each factor
psychological_sum <- rowSums(Student_Data_2[, c("self_esteem", "mental_health_history", "depression")])
physiological_sum <- rowSums(Student_Data_2[, c("headache", "blood_pressure", "sleep_quality", "breathing_problem")])
environmental_sum <- rowSums(Student_Data_2[, c("noise_level", "living_conditions", "safety", "basic_needs")])
academic_sum <- rowSums(Student_Data_2[, c("academic_performance", "study_load", "teacher_student_relationship", "future_career_concerns")])
social_sum <- rowSums(Student_Data_2[, c("social_support", "peer_pressure", "extracurricular_activities", "bullying")])

# Create a data frame to store the sums
sums_data <- data.frame(
  Factor = c("Psychological", "Physiological", "Environmental", "Academic", "Social"),
  Total_Count = c(sum(psychological_sum), sum(physiological_sum), sum(environmental_sum), sum(academic_sum), sum(social_sum))
)


ggplot(sums_data, aes(x = Factor, y = Total_Count, fill = Factor)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Counts of Negative Experiences or Conditions by Factor",
       x = "Factor",
       y = "Total Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate relative proportions
sums_data$Relative_Proportion <- sums_data$Total_Count / sum(sums_data$Total_Count)

# Print the updated sums_data
print(sums_data)
##          Factor Total_Count Relative_Proportion
## 1 Psychological       33908           0.4263602
## 2 Physiological       11114           0.1397478
## 3 Environmental       11745           0.1476820
## 4      Academic       11761           0.1478832
## 5        Social       11001           0.1383269
# Create a pie chart
library(ggplot2)
ggplot(sums_data, aes(x = "", y = Relative_Proportion, fill = Factor)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Relative Proportions of Negative Experiences or Conditions by Factor")

Psychological has the highest number of students reporting negative experiences

  1. Are there any noticeable trends or patterns when comparing different factors?
# Assuming your_dataset is the name of your dataset

# Create a summary data frame
sums_data <- data.frame(
  Factor = c("Psychological", "Physiological", "Environmental", "Academic", "Social"),
  Total_Count = c(33908, 11114, 11745, 11761, 11001)
)

# Print the summary data
print(sums_data)
##          Factor Total_Count
## 1 Psychological       33908
## 2 Physiological       11114
## 3 Environmental       11745
## 4      Academic       11761
## 5        Social       11001
# Calculate relative proportions
sums_data$Relative_Proportion <- sums_data$Total_Count / sum(sums_data$Total_Count)

# Print the updated sums_data
print(sums_data)
##          Factor Total_Count Relative_Proportion
## 1 Psychological       33908           0.4263602
## 2 Physiological       11114           0.1397478
## 3 Environmental       11745           0.1476820
## 4      Academic       11761           0.1478832
## 5        Social       11001           0.1383269
# Visualize relative proportions with a pie chart
library(ggplot2)
ggplot(sums_data, aes(x = "", y = Relative_Proportion, fill = Factor)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Relative Proportions of Negative Experiences or Conditions by Factor")

# Perform a chi-squared test for independence
contingency_table <- matrix(sums_data$Total_Count, nrow = 1, byrow = TRUE)
chi_squared_test <- chisq.test(contingency_table)
print(chi_squared_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  contingency_table
## X-squared = 25499, df = 4, p-value < 2.2e-16
  1. Which specific feature within each factor has the most significant impact on student stress, based on the dataset?
# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure", 
                   "sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety", 
                   "basic_needs", "academic_performance", "study_load", "teacher_student_relationship", 
                   "future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities", 
                   "bullying", "stress_level")

# Subset the dataset
subset_data <- Student_Data_2 %>%
  select(all_of(selected_cols))

# Melt the data for analysis
melted_data <- subset_data %>%
  gather(key = "Feature", value = "Value", -stress_level)

# Fit linear models for each feature
model_results <- lm(stress_level ~ Value * Feature, data = melted_data)

# Summarize model results
summary_results <- broom::tidy(model_results)

# Plot coefficients and confidence intervals
ggplot(summary_results, aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2) +
  labs(title = "Impact of Features on Stress Level",
       x = "Feature",
       y = "Coefficient") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Using Random Forest to find which factors have more influence

# Assuming your dataset is named 'your_dataset'

# Load necessary libraries
library(randomForest)
library(ggplot2)

# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure", 
                   "sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety", 
                   "basic_needs", "academic_performance", "study_load", "teacher_student_relationship", 
                   "future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities", 
                   "bullying", "stress_level")

# Subset the dataset
subset_data <- Student_Data_2 %>%
  select(all_of(selected_cols))

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
train_indices <- sample(1:nrow(subset_data), 0.8 * nrow(subset_data))
train_data <- subset_data[train_indices, ]
test_data <- subset_data[-train_indices, ]

# Fit a random forest model
rf_model <- randomForest(stress_level ~ ., data = train_data, ntree = 500)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?
# Print variable importance
print(rf_model$importance)
##                              IncNodePurity
## self_esteem                      68.454547
## mental_health_history             2.531924
## depression                       44.165072
## headache                         53.925285
## blood_pressure                    4.706580
## sleep_quality                    85.688590
## breathing_problem                 7.072772
## noise_level                      12.776804
## living_conditions                 5.785769
## safety                           28.267338
## basic_needs                      40.029329
## academic_performance             32.189123
## study_load                        9.066602
## teacher_student_relationship     47.989078
## future_career_concerns           29.924651
## social_support                    8.959359
## peer_pressure                    20.796274
## extracurricular_activities       36.588251
## bullying                         43.549378

I guess this dataset is giving expected results that supports my assumptions.As it shows:Self Esteem, Depression,Heaache,Sleep Quality,Teacher Student Relationship, Extracurrricular Activities and bullying are the factors that affect the stress level ##Visualizing the factors vs stress

# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure", 
                   "sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety", 
                   "basic_needs", "academic_performance", "study_load", "teacher_student_relationship", 
                   "future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities", 
                   "bullying", "stress_level")

# Subset the dataset
subset_data <- Student_Data_2 %>%
  select(all_of(selected_cols))

# Function to generate plots with different themes and colors
generate_plots <- function(col, color) {
  list(
    scatter = ggplot(subset_data, aes(x = .data[[col]], y = stress_level, fill = .data[[col]])) +
      geom_point() +
      labs(title = paste("Scatter Plot of", col, "against Stress Level"),
           x = col,
           y = "Stress Level") +
      theme_minimal(),
    
    boxplot = ggplot(subset_data, aes(x = 1, y = .data[[col]], fill = .data[[col]])) +
      geom_boxplot() +
      labs(title = paste("Box Plot of", col),
           x = NULL,
           y = col) +
      theme_minimal(),
    
    violin = ggplot(subset_data, aes(x = 1, y = .data[[col]], fill = .data[[col]])) +
      geom_violin() +
      labs(title = paste("Violin Plot of", col),
           x = NULL,
           y = col) +
      theme_minimal()
  )
}

# Create different types of plots with different colors
for (i in seq_along(selected_cols)) {
  col <- selected_cols[i]
  color <- rainbow(length(selected_cols))[i]
  plot_list <- generate_plots(col, color)
  
  # Save or display the plots
  lapply(plot_list, print)
}

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

##Tidying the data

# Replace empty strings with NA
Student_Data_2[Student_Data_2 == ""] <- NA

# Remove rows with missing values
Student_Data_without_missing_data_2<- na.omit(Student_Data_2)

# Check the dimensions of the cleaned data
dim(Student_Data_without_missing_data_2)
## [1] 1100   21

There are no missing values in this dataset For the time being I am now drawing LM against stress_level so that there could be good visualization

# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure", 
                   "sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety", 
                   "basic_needs", "academic_performance", "study_load", "teacher_student_relationship", 
                   "future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities", 
                   "bullying", "stress_level")

# Subset the dataset
subset_data <-Student_Data_without_missing_data_2%>%
  select(all_of(selected_cols))

# Create linear regression plots for each feature
for (col in selected_cols) {
  if (col != "stress_level") {
    lm_plot <- ggplot(subset_data, aes(x = .data[[col]], y = stress_level)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +
      labs(title = paste("Linear Regression Plot of", col, "against Stress Level"),
           x = col,
           y = "Stress Level") +
      theme_minimal()
    
    print(lm_plot)
  }
}
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

It could be seen that there is positive lm stress_level against these following featured:Mental Health History,Depression, Headache, Blood Pressure, Breathing Problem, Noise Level, Study Load, Peer Pressure, Extracurricular Activities, Bullying, Future Career Concerns, And these show negative relation: Self esteem, Sleep Quality, Living Conditions, Safety, Basic Needs, Academic Performance, Teacher Student Relationship, Social Support

##Modeling

# Step 1: Split the data into training, testing, and validation sets
set.seed(123)  # For reproducibility
index <- createDataPartition(Student_Data_2$stress_level, p = 0.7, list = FALSE)
train_data <- Student_Data_2[index, ]
test_validation_data <- Student_Data_2[-index, ]

# Further split test_validation_data into testing and validation sets
index_test <- createDataPartition(test_validation_data$stress_level, p = 0.5, list = FALSE)
test_data <- test_validation_data[index_test, ]
validation_data <- test_validation_data[-index_test, ]

# Convert stress_level to a factor
train_data$stress_level <- as.factor(train_data$stress_level)
test_data$stress_level <- as.factor(test_data$stress_level)


# Step 2: Train a machine learning model (Random Forest in this example)
rf_model <- randomForest(stress_level ~ ., data = train_data)

# Step 3: Make predictions on the testing set
predictions <- predict(rf_model, newdata = test_data)
# Check factor levels in the training dataset
levels_train <- levels(train_data$stress_level)

# Check factor levels in the testing dataset
levels_test <- levels(test_data$stress_level)

# Compare levels
print(levels_train)
## [1] "0" "1" "2"
print(levels_test)
## [1] "0" "1" "2"
# ... (previous code remains unchanged)

# Step 4: Visualize confusion matrix
conf_matrix <- confusionMatrix(predictions, test_data$stress_level)

# Extract confusion matrix as a table
conf_table <- as.table(conf_matrix)

heatmap(conf_table, col = terrain.colors(10), scale = "column",
        main = "Confusion Matrix", xlab = "Predicted", ylab = "Actual")

# Calculate class accuracies
class_accuracies <- conf_matrix$byClass

# Convert class_accuracies to a data frame
class_accuracies <- as.data.frame(class_accuracies)

# Print the column names
column_names <- names(class_accuracies)
print(column_names)
##  [1] "Sensitivity"          "Specificity"          "Pos Pred Value"      
##  [4] "Neg Pred Value"       "Precision"            "Recall"              
##  [7] "F1"                   "Prevalence"           "Detection Rate"      
## [10] "Detection Prevalence" "Balanced Accuracy"
# Melt the data for better visualization
melted_accuracies <- reshape2::melt(class_accuracies)
## No id variables; using all as measure variables
# Create a bar plot
ggplot(melted_accuracies, aes(x = variable, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Class Accuracies Bar Plot",
       x = "Metric",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Thankfully this model has preety good accuracy of more then 75% . I’m very glad Dr.Swan helped in achieving this goal. From his suggestions I went ahead and did my analysis using other dataset and this dataset supported my assumptions.